Search CORE

1,978 research outputs found

Pb-Hash: Partitioned b-bit Hashing

Author: Li Ping
Zhao Weijie
Publication venue
Publication date: 28/06/2023
Field of study

Many hashing algorithms including minwise hashing (MinHash), one permutation hashing (OPH), and consistent weighted sampling (CWS) generate integers of

B

bits. With

k

hashes for each data vector, the storage would be

B\times k

bits; and when used for large-scale learning, the model size would be

2^B\times k

, which can be expensive. A standard strategy is to use only the lowest

b

bits out of the

B

bits and somewhat increase

k

, the number of hashes. In this study, we propose to re-use the hashes by partitioning the

B

bits into

m

chunks, e.g.,

b\times m =B

. Correspondingly, the model size becomes

m\times 2^b \times k

, which can be substantially smaller than the original

2^B\times k

. Our theoretical analysis reveals that by partitioning the hash values into

m

chunks, the accuracy would drop. In other words, using

m

chunks of

B/m

bits would not be as accurate as directly using

B

bits. This is due to the correlation from re-using the same hash. On the other hand, our analysis also shows that the accuracy would not drop much for (e.g.,)

m=2\sim 4

. In some regions, Pb-Hash still works well even for

m

much larger than 4. We expect Pb-Hash would be a good addition to the family of hashing methods/applications and benefit industrial practitioners. We verify the effectiveness of Pb-Hash in machine learning tasks, for linear SVM models as well as deep learning models. Since the hashed data are essentially categorical (ID) features, we follow the standard practice of using embedding tables for each hash. With Pb-Hash, we need to design an effective strategy to combine

m

embeddings. Our study provides an empirical evaluation on four pooling schemes: concatenation, max pooling, mean pooling, and product pooling. There is no definite answer which pooling would be always better and we leave that for future study

arXiv.org e-Print Archive

Constrained Approximate Similarity Search on Proximity Graph

Author: Li Ping
Tan Shulong
Zhao Weijie
Publication venue
Publication date: 07/11/2022
Field of study

Search engines and recommendation systems are built to efficiently display relevant information from those massive amounts of candidates. Typically a three-stage mechanism is employed in those systems: (i) a small collection of items are first retrieved by (e.g.,) approximate near neighbor search algorithms; (ii) then a collection of constraints are applied on the retrieved items; (iii) a fine-grained ranking neural network is employed to determine the final recommendation. We observe a major defect of the original three-stage pipeline: Although we only target to retrieve

k

vectors in the final recommendation, we have to preset a sufficiently large

s

(

s > k

) for each query, and ``hope'' the number of survived vectors after the filtering is not smaller than

k

. That is, at least

k

vectors in the

s

similar candidates satisfy the query constraints. In this paper, we investigate this constrained similarity search problem and attempt to merge the similarity search stage and the filtering stage into one single search operation. We introduce AIRSHIP, a system that integrates a user-defined function filtering into the similarity search framework. The proposed system does not need to build extra indices nor require prior knowledge of the query constraints. We propose three optimization strategies: (1) starting point selection, (2) multi-direction search, and (3) biased priority queue selection. Experimental evaluations on both synthetic and real data confirm the effectiveness of the proposed AIRSHIP algorithm. We focus on constrained graph-based approximate near neighbor (ANN) search in this study, in part because graph-based ANN is known to achieve excellent performance. We believe it is also possible to develop constrained hashing-based ANN or constrained quantization-based ANN

arXiv.org e-Print Archive